10 research outputs found
Recommended from our members
A Nearest-Neighbor Approach to Indicative Web Summarization
Through their role of content proxy, in particular on search engine result pages, Web summaries play an essential part in the discovery of information and services on the Web. In their simplest form, Web summaries are snippets based on a user-query and are obtained by extracting from the content of Web pages. The focus of this work, however, is on indicative Web summarization, that is, on the generation of summaries describing the purpose, topics and functionalities of Web pages. In many scenarios — e.g. navigational queries or content-deprived pages — such summaries represent a valuable commodity to concisely describe Web pages while circumventing the need to produce snippets from inherently noisy, dynamic, and structurally complex content. Previous approaches have identified linking pages as a privileged source of indicative content from which Web summaries may be derived using traditional extractive methods. To be reliable, these approaches require sufficient anchortext redundancy, ultimately showing the limits of extractive algorithms for what is, fundamentally, an abstractive task. In contrast, we explore the viability of abstractive approaches and propose a nearest-neighbors summarization framework leveraging summaries of conceptually related (neighboring) Web pages. We examine the steps that can lead to the reuse and adaptation of existing summaries to previously unseen pages. Specifically, we evaluate two Text-to-Text transformations that cover the main types of operations applicable to neighbor summaries: (1) ranking, to identify neighbor summaries that best fit the target; (2) target adaptation, to adjust individual neighbor summaries to the target page based on neighborhood-specific template-slot models. For this last transformation, we report on an initial exploration of the use of slot-driven compression to adjust adapted summaries based on the confidence associated with token-level adaptation operations. Overall, this dissertation explores a new research avenue for indicative Web summarization and shows the potential value, given the diversity and complexity of the content of Web pages, of transferring, and, when necessary, of adapting, existing summary information between conceptually similar Web pages
A Hierarchical Model of Web Summaries
We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
Enabling Interoperability For Autonomous Digital Libraries : An API To CiteSeer Services
We introduce CiteSeer-API, a public API to CiteSeer-like services. CiteSeer-API is SOAP/WSDL based and allows for easy programatical access to all the specific functionalities offered by CiteSeer services, including full text search of documents and citations and citation-based document discovery. CiteSeer-API is currently showcased on SMEALSearch [10], a digital library search engine for business academic publications
A Service-Oriented Architecture for Digital Libraries
CiteSeer is currently a very large source of meta-data information on the World Wide Web (WWW). This meta-data is the key material for the Semantic Web. Still, CiteSeer is not yet a Semantic-enabled service and therefore its meta-data, although potentially usable by Semantic Web agents, is not yet reachable using the Semantic Web mechanisms. The complexity of CiteSeer, that is the range of tasks it supports, make the transition to a Semantic-enabled service a non-trivial task. While human users tend to perceive CiteSeer as a single well-integrated service, we believe it is best seen -- from a machine perspective -- as a collection of services, each service performing a specific task. In this paper we show our approach to enable CiteSeer on the Semantic Web in order to allow the use of its meta-data through the Semantic Web. We first introduce an intuitive Application Programming Interface (API) to the CiteSeer software, then show that an efficient integration of CiteSeer in the Semantic Web can be best achieved by independently integrating the services that comprise it. We believe the effort presented here towards the Semantic-integration of a complex Information Retrieval system could be used as an integration model for arbitrary systems
Citeseer-api: towards seamless resource location and interlinking for digital libraries
We introduce CiteSeer-API, a public API to CiteSeer-like services. CiteSeer-API is SOAP/WSDL based and allows for easy programmatical access to all the specific functionalities offered by CiteSeer services, including full text search of documents and citations and citation-based document discovery. In order to enable operability and interlinking with arbitrary software agents and digital library systems, CiteSeer-API uses digital content signatures to create system-independent handles for the Document, Citation and Group resources of CiteSeer servers. We discuss specific functionalities of CiteSeer-API that take advantage of these handlers in order to enable seamless location of CiteSeer resources. Finally we argue that the digital signature scheme used by CiteSeer-API is well suited for the creation of machine-usable semantic descriptions of digital library services which is the key toward seamless discovery and integration of services such as CiteSeer-API. CiteSeer-API is currently showcased on CiteSeer.IST, the CiteSeer server of the School o
eBizSearch: A Niche Search Engine for e-Business
Niche Search Engines offer an efficient alternative to traditional search engines when the results returned by general-purpose search engines do not provide a sufficient degree of relevance. By taking advantage of their domain of concentration they achieve higher relevance and offer enhanced features. We discuss a new niche search engine, eBizSearch, based on the technology of CiteSeer and dedicated to e-business and e-business documents. We present the integration of CiteSeer in the framework of eBizSearch and the process necessary to tune the whole system towards the specific area of e-business. We also discuss how using machine learning algorithms we generate metadata to make eBizSearch Open Archives compliant. eBizSearch is a publicly available service and can be reached at [3]